Skip to content

[Enhancement] Implement metrics reporting for MemTrackerManager#68170

Open
arin-mirza wants to merge 9 commits intoStarRocks:mainfrom
arin-mirza:add-metric-registry-to-mem-tracker-manager
Open

[Enhancement] Implement metrics reporting for MemTrackerManager#68170
arin-mirza wants to merge 9 commits intoStarRocks:mainfrom
arin-mirza:add-metric-registry-to-mem-tracker-manager

Conversation

@arin-mirza
Copy link
Contributor

@arin-mirza arin-mirza commented Jan 20, 2026

Why I'm doing:

There are currently no backend metrics reporting for memory pools.

I previously tried to add them by extending the workgroup metrics, but this turned out to be an incorrect approach:

What I'm doing:

This PR implements metric reporting for MemTrackerManager and adds the following new metrics:

  • mem_pool_mem_limit_bytes
  • mem_pool_mem_usage_bytes
  • mem_pool_mem_usage_ratio
  • mem_pool_workgroup_count

The implementation follows the same locking structure that is present in WorkGroupManager.

  • It was necessary to add a new mutex for MemTrackerManager because the update_metrics callback hook passed to MetricRegistry needs to be a closure which captures a write lock.
  • The unlocked gap inside add_metrics method is unavoidable to AB-BA deadlock scenario with the metrics collector.
  • Metrics entries are never deleted as it would complicate the thread synchronization even further. This is also the case for the existing implementation in WorkGroupManager.

Minor: Changed list_mem_trackers() method to not return the default memory pool name.

Tests and Docs

  • I did not add any test cases as there were not any for workgroup metrics either. Let me know if this is necessary.
  • I did not verify that the new metrics are being reported correctly by building and running the starrocks fe/be, as I am currently unable to build the engine locally.
  • I updated the user documentation.
    • I am not a Chinese or Japanese speaker so I used AI for the translation. I would appreciate it if a native speaker could review my additions to ensure the tone is correct. :)

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

@arin-mirza arin-mirza requested a review from a team as a code owner January 20, 2026 11:06
@github-actions github-actions bot added behavior_changed documentation Improvements or additions to documentation labels Jan 20, 2026
@StarRocks-Reviewer
Copy link

@cursor review

1 similar comment
@StarRocks-Reviewer
Copy link

@cursor review

@arin-mirza
Copy link
Contributor Author

@alvin-celerdata @kevincai I closed the previous PR where you were reviewers, can I get a review for this one, please? :)

@alvin-celerdata
Copy link
Contributor

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7a13d9f6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@StarRocks-Reviewer
Copy link

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@arin-mirza
Copy link
Contributor Author

arin-mirza commented Jan 28, 2026

@alvin-celerdata Can this PR be merged? Is there anything else that needs to be done?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements metrics reporting for MemTrackerManager to expose memory pool usage statistics. Previously, there were no backend metrics for memory pools. The implementation adds four new metrics: mem_pool_mem_limit_bytes, mem_pool_mem_usage_bytes, mem_pool_mem_usage_ratio, and mem_pool_workgroup_count. The implementation follows the established locking and metrics registration patterns from WorkGroupManager to avoid deadlocks with the metrics collector.

Changes:

  • Added metrics infrastructure to MemTrackerManager with thread-safe registration and update mechanisms
  • Updated list_mem_trackers() to exclude the default memory pool, improving consistency
  • Added documentation in English, Chinese, and Japanese for the new metrics

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
be/src/exec/workgroup/mem_tracker_manager.h Added MemTrackerMetrics struct, metrics-related private methods, and mutex for thread synchronization
be/src/exec/workgroup/mem_tracker_manager.cpp Implemented metrics registration, update logic, and modified list_mem_trackers() to exclude default pool
be/test/exec/workgroup/work_group_manager_test.cpp Updated test expectations to reflect that default memory pool is no longer included in the list, removed unnecessary sleep
docs/en/administration/management/monitoring/metrics.md Added English documentation for the four new metrics
docs/zh/administration/management/monitoring/metrics.md Added Chinese documentation for the four new metrics
docs/ja/administration/management/monitoring/metrics.md Added Japanese documentation for the four new metrics

@kevincai kevincai requested a review from trueeyu February 13, 2026 03:36
- Unit: -
- Description: Ratio of internal table scan thread time slices used by each resource group to the total used by all resource groups. This is an average value over the time interval between two metric retrievals.

### mem_pool_mem_limit_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual metric name will be prefixed by starrocks_be_, so to end user it is actually starrocks_be_mem_pool_mem_limit_bytes, shall use the final name since this is the doc to end user.

Copy link
Contributor Author

@arin-mirza arin-mirza Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the documentation to use starrocks_be_ prefix.


I think the documentation is inconsistent in this regard.

Most resource group related metrics are duplicated, a description exists with and without the starrocks_be prefix. I also checked Datadog metrics to see if they are actually reported twice, but no, they are always reported under a name with the starrocks_be prefix.

In metrics.md:

Without Prefix With Prefix in Datadog
resource_group_mem_limit_bytes starrocks_be_resource_group_mem_limit_bytes with prefix
resource_group_mem_inuse_bytes - with prefix
resource_group_cpu_limit_ratio starrocks_be_resource_group_cpu_limit_ratio with prefix
resource_group_cpu_use_ratio starrocks_be_resource_group_cpu_use_ratio with prefix
resource_group_scan_use_ratio - with prefix
resource_group_inuse_cpu_cores - with prefix
resource_group_connector_scan_use_ratio - with prefix
- starrocks_be_resource_group_mem_allocated_bytes does not exist

I believe all metrics above should only have one entry with the starrocks_be prefix.

Moreover, I could not find starrocks_be_resource_group_mem_allocated_bytes in Datadog. This metric was renamed to resource_group_mem_inuse_bytes in 6204611 on 2022-06-29. The entry should be removed from the user documentation if safe to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the doc is messed up somehow, the actual metrics name produced via /metrics should be the canonical one in this doc.

Just keep the new added ones prefixed with the starrocks_be_, other inconsistent ones will be fixed in a dedicated PR.

metrics->workgroup_count->set_value(child_count);
}
} else {
// Metrics entries for deleted shared_mem_trackers are never deleted, but simply set to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a long running system with Frequent workergroup creation and deletion, will thes garbage metrics accumulated and cause memory occupation and long useless serialization to the /metrics interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.

Workgroups by default belong to DEFAULT_MEM_POOL , these are not tracked in MemTrackerManager and it does not report mempool statistics for such work groups. Frequent creation and deletion of workgroups under a memory pool is a special case.

In the general case, we are already paying the price of not deleting the metrics for workgroups. Compare my implementation with _update_metrics_unlocked() method in work_group.cpp:

} else {
VLOG(2) << "workgroup update_metrics " << name << ", workgroup not exists so cleanup metrics";
wg_metrics->cpu_limit->set_value(0);
wg_metrics->inuse_cpu_ratio->set_value(0);
wg_metrics->inuse_scan_ratio->set_value(0);
wg_metrics->inuse_connector_scan_ratio->set_value(0);
wg_metrics->mem_limit->set_value(0);
wg_metrics->inuse_mem_bytes->set_value(0);
wg_metrics->connector_scan_mem_bytes->set_value(0);
wg_metrics->running_queries->set_value(0);
wg_metrics->total_queries->set_value(0);
wg_metrics->concurrency_overflow_count->set_value(0);
wg_metrics->bigquery_count->set_value(0);
wg_metrics->inuse_cpu_cores->set_value(0);
}

Your concern of garbage metrics being accumulated applies here even more. I believe the reason it was implemented this way is that deleting metrics entries make handling race conditions more complicated and error-prone.

I can change my implementation so that unused metrics are deleted instead of being set to 0. However, I am not sure if it is worth the extra effort as work_group.cpp does not delete its own metrics anyway.

Let me know which one you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. will see if other reviewers have some thought. I am ok to keep as is for now.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 13, 2026

🌎 Translation Required?

All translation files are up to date.
Great job! No translation actions are required for this PR.

🕒 Last updated: Fri, 13 Feb 2026 15:26:43 GMT

@arin-mirza arin-mirza force-pushed the add-metric-registry-to-mem-tracker-manager branch from 511ea62 to 3671e20 Compare February 13, 2026 14:55
Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
@arin-mirza arin-mirza force-pushed the add-metric-registry-to-mem-tracker-manager branch from 3671e20 to 3be1477 Compare February 13, 2026 15:04
@arin-mirza
Copy link
Contributor Author

arin-mirza commented Feb 13, 2026

I rebased onto the latest main.

Signed-off-by: arin-mirza <a.mirza@celonis.com>
Signed-off-by: arin-mirza <a.mirza@celonis.com>
@github-actions
Copy link
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Contributor

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Contributor

[BE Incremental Coverage Report]

fail : 41 / 68 (60.29%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/workgroup/mem_tracker_manager.cpp 39 66 59.09% [120, 144, 145, 146, 147, 149, 150, 152, 153, 154, 155, 156, 158, 159, 161, 162, 164, 165, 169, 170, 172, 173, 175, 176, 178, 179, 183]
🔵 be/src/exec/workgroup/mem_tracker_manager.h 2 2 100.00% []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4.1 docs-maintainer documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants